Adding 16bpp pixels using MMX
Last time I presented a reasonably fast way of adding 16bpp (bit-per-pixel) pixels with integer unit. "Adding with saturation" took 13.5 clocks per pixel and "pixel average" took 5 clocks per pixel.
Nowadays all (new) computers have processors supporting MMX instruction set. What can one do with it? Let's see.
The workaround
BRAD presented a better way of adding 16bpp pixels with the integer unit. As MMX registers are 64-bit, we can take another step forward by performing on 4 16-bit pixels exclusively, which will double the efficiency of the algorithm.
Routines
The first routine does "adding with saturation" and the other "pixel average". Both routines read pixels from buffer/stream at [esi], mix them with pixels at [edi] and store results also at [edi]. The MMX specifics, i.e. lack of immediates, requires us to keep masks in memory or in unused MMX registers. It is better to keep constants in registers, since instructions that access memory can be paired only in U-pipe.
And like before, by right of each instruction I marked the cycle in which
it is issued. '-' means that the instruction is paired in the V-pipe,
i.e. it is issued in the same cycle as the previous instruction. Note that
pairing concerns only Pentium MMX.
; Load unused registers with constants (before the loop) movq mm6, [_7BEF_7BEF_7BEF_7BEFh] ; not-MSB mask movq mm7, [_8410_8410_8410_8410h] ; MSB mask ; Load 4 pixels from each buffer movq mm0, [esi] ; 1 movq mm1, [edi] ; 2 movq mm2, mm0 ; - movq mm3, mm1 ; 3 ; Isolate MSBs and the rest of bits pand mm0, mm6 ; - mm6 == not-MSB mask pand mm1, mm6 ; 4 pand mm2, mm7 ; - mm7 == MSB mask pand mm3, mm7 ; 5 ; Do the addition paddw mm0, mm1 ; - ; Create mask movq mm1, mm2 ; 6 por mm2, mm3 ; - mm2 = MSB1 | MSB2 pand mm1, mm3 ; 7 mm1 = MSB1 & MSB2 movq mm3, mm2 ; - pand mm3, mm0 ; 8 mm3 = (MSB1|MSB2)&newMSB ; Compose result with MSBs por mm0, mm2 ; - ; Finish mask creation por mm1, mm3 ; 9 psrlw mm1, 4 ; 10 paddw mm1, mm6 ; 11 mm6 == not-MSB mask pxor mm1, mm6 ; 12 ; Store result ORed with mask por mm0, mm1 ; 13 movq [edi], mm0 ; 15 (was pipeline stall)And pixel average:
; Load unused registers with constants (before the loop) movq mm6, [_7BEF_7BEF_7BEF_7BEFh] ; not-MSB mask movq mm7, [_0821_0821_0821_0821h] ; LSB mask ; Load 4 pixels from each buffer movq mm0, [esi] ; 1 movq mm1, [edi] ; 2 ; Create carry mask of LSB addition movq mm2, mm0 ; - pand mm2, mm1 ; 3 psrlq mm0, 1 ; - pand mm2, mm7 ; 4 mm7 == LSB mask ; Calculate average psrlq mm1, 1 ; - pand mm0, mm6 ; 5 mm6 == not-MSB mask pand mm1, mm6 ; - paddw mm0, mm1 ; 6 paddw mm0, mm2 ; 7 ; Store result movq [edi], mm0 ; 9 (was pipeline stall)
Now we see, that we can do adding with saturation in 15 cycles, and pixel average in 9 cycles. Both routines process four pixels in a pass, so the first takes 3.75 cycles per pixel and the other 2.25 cycles per pixel. Compare this to BRAD's integer case! And there is even a greater speed improvement over my own integer routines from Hugi #17!
There is also a matter of cache misses. As the routines access new and new pixels, they often incur cache miss penalty. If we use the prefetch instruction (introduced in Pentium III and K6-2), the routines will be really as fast as we estimate.
Both prefetch and loop management can be realized using the remaining gaps in the pipeline, so the final routines will take about two clocks more (0.5 clocks per pixel extra).
How it works
If you are interested in detailed description of the algorithm, refer to BRAD's article - he is the author of the original method.
In opposite to BRAD I used not-MSB mask to create carry masks in the pixel average routine, and thus reduced the number of constants needed. Will it work? The table below explains it showing step by step what happens to the isolated MSB.
---------------------------------------- | red and blue | green | ---------------------------------------- | no carry|carry | no carry|carry | ---------------------------------------- | 00000b |10000b | 000000b |100000b | ---------------------------------------- | shift right by 4 | shift right by 4 | ---------------------------------------- | 00000b |00001b | 000000b |000010b | ---------------------------------------- | add 01111b | add 011111b | ---------------------------------------- | 01111b |10000b | 011111b |100001b | ---------------------------------------- | xor 01111b | xor 011111b | ---------------------------------------- | 00000b |11111b | 000000b |111110b | ----------------------------------------
If you are not familiar with MMX, refer to Intel manuals.
Credits
Special thanks to Rawhed - he has written the original article about the topic. Also thanks to TAD. If not him, I would not write about the MMX issue. Further thanks to BRAD for his better algorithm.